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Abstract 

We consider the problem of online nonparametric regression with arbitrary deterministic sequences. 
Using ideas from the chaining technique, we design an algorithm that achieves a Dudley-type 
regret bound similar to the one obtained in a non-constructive fashion by Rakhlin and Sridharan 
(2014). Our regret bound is expressed in terms of the metric entropy in the sup norm, which yields 
optimal guarantees when the metric and sequential entropies are of the same order of magnitude. In 
particular our algorithm is the first one that achieves optimal rates for online regression over Holder 
balls. In addition we show for this example how to adapt our chaining algorithm to get a reasonable 
computational efficiency with similar regret guarantees (up to a log factor). 

Keywords: online learning, nonparametric regression, chaining, individual sequences. 


1. Introduction 

We consider the setting of online nonparametric regression for arbitrary deterministic sequences, 
which unfolds as follows. First, the environment chooses a sequence of observations in M 

and a sequence of input vectors in A, both initially hidden from the forecaster. At each time 

instant t G N* = {1,2,...}, the environment reveals the data x* G A; the forecaster then gives a 
prediction yt G M; the environment in turn reveals the observation yt G M; and finally, the forecaster 
incurs the square loss {yt — ytf'- 

The term online nonparametric regression means that we are interested in forecasters whose 
regret t t 

RegT(-^) = ytf - inf 

t=l ■' t=l 

over standard nonparametric function classes F C is as small as possible. In this paper we 
design and study an algorithm that achieves a regret bound of the form 

Reg7^(-F) ^ ciR^(l + logA/'oo(-T',7)) + C 2 BVt [ s/logMoo{F, e)de , (1) 

Jo 

where 7 G (^,5) is a parameter of the algorithm, where B is an upper bound on maxi^gt^r \yt\, 
and where log A^oo (-T", e) denotes the metric entropy of the function set F in the sup norm at scale e 
(cf. Section 1.4). 

The integral on the right-hand side of ( 1 ) is very close to what is known in probability theory as 
Dudley’s entropy integral, a useful tool to upper bound the expectation of a centered stochastic pro¬ 
cess with subgaussian increments (see, e.g., Talagrand 2005; Boucheron et al. 2013). In statistical 
learning (with i.i.d. data), Dudley’s entropy integral is key to derive risk bounds on empirical risk 
minimizers; see, e.g., Massart (2007); Rakhlin et al. (2013). 


© 2015 P. Gaillard & S. Gerchinovitz. 



Gaillard Gerchinovitz 


Very recently Rakhlin and Sridharan (2014) showed that the same type of entropy integral ap¬ 
pears naturally in regret bounds for online nonparametric regression. The most part of their analysis 
is non-constructive in the sense that their regret bounds are obtained without explicitly constructing 
an algorithm. (Though they provide an abstract relaxation recipe for algorithmic purposes, we were 
not able to turn it into an explicit algorithm for online regression over nonparametric classes such 
as Holder balls.) 

One of our main contributions is to provide an explicit algorithm that achieves the regret bound (1). 
We note however that our regret bounds are in terms of a weaker notion of entropy, namely, metric 
entropy instead of the smaller (and optimal) sequential entropy. Fortunately, both notions are of the 
same order of magnitude for a reasonable number of examples, such as the ones outlined just below. 
We leave the question of modifying our algorithm to get sequential entropy regret bounds for future 
work. 

The regret bound (1) —that we call Dudley-type regret bound thereafter—can be used to obtain 
optimal regret bounds for several classical nonparametric function classes. Indeed, when F has a 
metric entropy \ogNoo{^ , s) ^ Cpe~^ with^ p G ( 0 , 2 ), the bound ( 1 ) entails 

RegT^F) ^ + ciB‘^Cpy-P + P e-P^'^de 

Jo 

= ciB^ + ciB^Cpj-P + (- 2 ) 

2 — p 

for the choice of 7 = An example is given by Holder classes F with regularity 

/3 > 1/2 (cf. Tsybakov 2009, Def 1.2). We know from (Kolmogorov and Tikhomirov, 1961) 
or (Lorentz, 1962, Theorem 2) that they satisfy log A/’oo(-^) e) = Therefore, (2) entails 

a regret bound Reg 2 ’(J^) = which is in a way optimal since it corresponds to the 

optimal (minimax) quadratic risk in statistical estimation with i.i.d. data. 

1.1. Why a simple Exponentially Weighted Average forecaster is not sufficient 

A natural approach (see Vovk 2006) to compete against a nonparametric class F relies in running 
an Exponentially Weighted Average forecaster (EWA, see Cesa-Bianchi and Eugosi 2006, p.l4) on 
an e-net F^’^^ of F of finite size J\foo (B'f)- This yields a regret bound of order eT log AZ/o (-T", e). 
The first term eT is due to the approximation of F by F^'^\ while the second term is the regret 
suffered by EWA on the finite class of experts F^^\ As noted by Rakhlin and Sridharan (2014, 
Remark 11), the above regret bound is suboptimal for large nonparametric classes F. Indeed, for a 
metric entropy of order e~P with p G ( 0 , 2 ), optimizing the above regret bound in e entails a regret 
of order ©(RP/fP+i)) when ( 1 ) yields the better rate C)(Tp/(p+ 2 )). 

1.2. Constructing an online algorithm via the chaining technique 

Next we explain how the chaining technique from Dudley (1967) (see appendix A for a brief re¬ 
minder) can be used to build an algorithm that satisfies a Dudley-type regret bound (1). We ap¬ 
proximate any function / G A" by a sequence of refining approximations 7 ro(/) G F^^\'Ki{f) G 
..., such that for all k ^ 0, supj ||vrfc(/) — /||oo ^ 7 / 2 ^ and cardA'^^^ = Afoo{F,'y/2^), so 

1. Whenp > 2, we can also derive Dudley-type regret bounds that lead to a regret of in the same spirit as 

in Rakhlin and Sridharan (2014). We omitted this case to ease the presentation. 
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that: 

= inf ^(yt-7ro(/)(xt)[7rfc+i(/) - 7rfc(/)] (x*) J . 

t=l^ fc=0s__ / 

|M|^sS37/2'=+i 

We use the above decomposition in Algorithm 2 (Section 2.2) by performing two simultaneous 
aggregation tasks at two different scales: 

• high-scale aggregation: we run an Exponentially Weighted Average forecaster to be compet¬ 
itive against every function 7ro(/) in the coarsest set 

• low-scale aggregation: we run in parallel many instances of (an extension of) the Exponenti¬ 
ated Gradient (EG) algorithm so as to be competitive against the increments TTk+i{f) — T^k{f)- 
The advantage of using EG is that even if the number of increments 7Tk+i{f) — T^kif) 
is large for small scales e, the size of the gradients is very small, hence a manageable regret. 

At the core of the algorithm lies the Multi-variable Exponentiated Gradient algorithm (Algo¬ 
rithm 1) that makes it possible to perform low-scale aggregation at all scales e < 7 simultaneously. 

1.3. Comparison to previous works and main contributions 

Earlier uses of chaining and related techniques Several ideas that we use in this paper were 
already present in the literature. Opper and Haussler (1997) and Cesa-Bianchi and Eugosi (2001) 
derived Dudley-type regret bounds for the log loss using a two-scale aggregation and chaining argu¬ 
ments. At small scales, their algorithm is very specific to the log loss and it is unclear how to extend 
it to other exp-concave loss functions such as the square loss. Besides, they only use the chaining 
technique in their analysis by reducing the regret to an expected supremum, in the same spirit as 
Rakhlin et al. (2013) (square loss, batch setting) and Rakhlin and Sridharan (2014) (square loss, 
online learning with individual sequences). On the contrary, Cesa-Bianchi and Lugosi (1999) built 
an algorithm via chaining ideas (they use discretization sets similar to those above). However, 
their algorithm is specific fo linear loss functions (e.g., absolute loss with binary observations), so 
that no linearization step and no high-scale aggregation are required. 

Other papers on online learning with nonparametric classes Related works also include the 
paper by Vovk (2006) where—for the problem under consideration here—suboptimal regret bounds 
are derived with the Exponentially Weighted Average forecaster. Another example of paper that 
addressed online learning over nonparametric function classes is the one by Hazan and Megiddo 
(2007). They also studied the regret with respect to the set of Lipschitz functions on [0,1]'^, but 
their loss functions are Lipschitz, hence their slower rates compared to ours. 

Main contributions and outline of the paper Our contributions are threefold: we first design the 
Multi-variable Exponentiated Gradient algorithm (Section 2.1) which is crucial for the linearization 
step at all small scales simultaneously. We then present our main algorithm and derive a Dudley- 
type regret bound as in (1) (Section 2.2). This general algorithm is computationally intractable for 
nonparametric classes. In Section 3 we design an efficient algorithm in the case of Holder classes. 
To the best of our knowledge, this is the first time the chaining technique has been used in a concrete 
fashion for individual sequences. Some proofs are postponed to the appendix. 
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1.4. Some useful definitions 

Let C be a set of bounded functions endowed with the sup norm ||/||oo — sup 2 .g_:^^ \f{x)\. For 
all e > 0, we call proper e-net any subset Q such that V/ G -F, 3(7 G ^ : ||/ — ^Hoo ^ (If 
Q % call it non-proper.) The cardinality of the smallest proper e-net is denoted by Moo £)> 

and the logarithm \ogMoo{^, £) is called the metric entropy of F at scale e. When this quantity is 
finite for all e > 0, we say that (^F, || • ||oo) is totally bounded. 

2. The Chaining Exponentially Weighted Average Forecaster 

In this section we design an online algorithm—the Chaining Exponentially Weighted Average fore¬ 
caster —that achieves the Dudley-type regret bound (1). In Section 2.1 below, we first define a sub- 
roufine fhaf will prove crucial in our analysis, and whose applicabilify may extend beyond fhis paper. 

2.1. Preliminary: the Multi-variable Exponentiated Gradient Algorithm 

Let Aat = |ri G ^ denote the simplex in In this subsection we 

define and study a new extension of the Exponentiated Gradient algorithm (Kivinen and War- 
muth, 1997; Cesa-Bianchi, 1999). This extension is meant to minimize a sequence of multi- 
variable loss functions ..., i—)• ..., simultaneously over all the vari¬ 
ables ... G Aat^ X ... X Ajv^f■ 

Our algorithm is described as Algorithm 1 below. We call it Multi-variable Exponentiated 
Gradient. When iT = 1, it boils down to the classical Exponentiated Gradient algorithm over the 
simplex A^r^. But when iF ^ 2, it performs K simultaneous optimization updates (one for each 
direction that lead to a global optimum by joint convexity of the loss functions It. 


Algorithm 1: Multi-variable Exponentiated Gradient 
input : optimization domain Atvi x ... x Aand tuning parameters r7W,...,r7W>0. 

initialization; set = (^,..., G Aat^, for all A; = 1,..., AT. 


for each round t = 1, 2 , ... do 

• Output G Aat, X ... X Ank observe the differentiable and jointly 

convex loss function : Aat^ x ... x A^v^ —M. 


end 


Compute the new weight vectors 


;(i) 


('“i+H • • • ! '*^4+1 


12 ) G Atv^ X ... X A 


Nk 


as follows: 


u 


(k) ^ 


exp 

V S=1 


7F) 

A+1 



i G {l,...,Afc}, 


where d^(k)£s is the partial derivative of ig with respect to the i-th component of and 
where the normalizing factor is (^2 > • • • > 
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The Multi-variable Exponentiated Gradient algorithm satisfies the regret bound of Theorem 1 
below. We first need some notations. We define fhe partial gradients 


V, 


,(fc) 


^t= id ik)lu---,d 

\ “iVi, / 


r\ (k) 

where o {*) it denofes fhe partial derivafive of it wifh respecf fo fhe scalar variable '. Nofe fhaf 


V^(fc) it is a funclion fhaf maps Aat^ x ... x AjVjf to . Next we also use the notation 


^ = sup max 

( 1 ). u(^) 




for the sup norm of any vector-valued function ip : A x ... x A , 1 ^ k ^ K. 

Theorem 1 Assume that the loss functions it : A^Vi x ... x A—)• M, f ^ 1, are differen¬ 
tiable and jointly convex. Assume also the following upper bound on their partial gradients: for all 
fee iT}, 


max ||V„(fe)£t|l ^ . 


l^t^T 


( 3 ) 


Then, the Multi-variable Exponentiated Gradient algorithm (Algorithm 1 ) tuned with the parame¬ 
ters = \J2 log(Afc)/T has a regret bounded as follows: 

T T K 


t=i 


t[u 


( 1 ) 


_ 


min y^it(ud\...,u^m ^ VG^^VlogiVfc , 

TT V / ^ 


k=l 


where the minimum is taken over all ..., G A^Vi x ... x AjVjf- 

The proof of Theorem 1 is postponed to Appendix D.l. 


2.2. The Chaining Exponentially Weighted Average Forecaster 

In this section we introduce our main algorithm: the Chaining Exponentially Weighted Average 
forecaster. A precise definition will be given in Algorithm 2 below. For the sake of clarity, we first 
describe the main ideas underlying this algorithm. 

Recall that we aim at proving a regret bound of the form (1), whose right-hand side consists of 
two main terms: 

B^logA/'oo(2^,7) and Bs/T [ ^\ogMao{iF,e)de . 

Jo 

Our algorithm performs aggregation at two different levels: one level (at all scales s G (0, 7 ]) to 
get the entropy integral above, and another level (at scale 7 ) to get the other term log A^oo (.T", 7 ). 
More precisely: 

• for all A: G N,letA’(^)be aproper 7 / 2 ^-netof {T, IHIgo) ofminimal cardinality^ AAqo (A", 7 / 2 ^); 

• for all A: ^ 1, set = {itkif) “ ^fc-i(/) : / £ where 

V/ G JT, 7rfc(/) G argmin^g_^(fc) \\f - h\\^ . 

We denote: 

• the elements of JT(o) by f[^^ ,..., /® with Nq = A/^oo (A", 7 ); 

2. We assume that (E, IMIoj,) is totally bounded. 
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• the elements of by ... note that ^ A^oo 7 / 2 ^)A/’oo(.7^, 7 / 2 ^ ^). 


With the above definitions, our algorithm can be described as follows: 

1. Low-scale aggregation: for every j G {1, ■ • •, A^o}. we use a Multi-variable Exponentiated 


Gradient forecaster to mimic the best predictor in the neighborhood of f!- 
round f ^ 1 , 

K Nk 

7 A /•(O) I 

hj - jj +2^2^u^ i Pj , 
k=li=l 


( 0 ). 


we set, at each 


(4) 


where AT = [log 2 (yT/i?)], so that the lowest scale is 7 / 2 ^ ^ B/T. The above weight vec- 
'^(i k) 

tors uf’ G Atvj, are defined in Equation ( 6 ) of Algorithm 2. They correspond exactly to the 
weight vectors output by the Multi-variable Exponentiated Gradient forecaster (Algorithm 1) 
applied to the loss functions ii : x ... x AjVjf — >• K defined for all f ^ 1 (j is fixed) by 






K Affe 


(k)Jk), 


k=l i=l 


(5) 


2. High-scale aggregation: we use a standard Exponentially Weighted Average forecaster to ag¬ 
gregate all the ftj, j = 1 ,..., No, as follows: Nq 

ft = '^Wtjft,j , 
i=i 

where the weights wtj are defined in Equation (7) of Algorithm 2. At time t, our algorithm 
predicts yt with % = ft{xt)- 

Next we show that the Chaining Exponentially Weighted Average forecaster satisfies a Dudley-type 
regret bound as in ( 1 ). 

Theorem 2 Let B > 0, T ^ 1, and 7 G {B/T, B). 

• Assume that maxisgisg-r \yt\ ^ B and that sup^gj- ||/||oo ^ B. 

• Assume that {B, IHloo) is totally bounded and define = {/i^\ ■ • •, } tind = 

\ ...,g'"^l],k = l,...,K,as above. 

Then, the Chaining Exponentially Weighted Average forecaster (Algorithm 2) tuned with the param¬ 
eters = l/(50i?^) and = ^J2 \og{Nf)/T 2^/ (2>QBy) for all k = 1,..., K satisfies: 

r7/2 

Regj^(JT) ^ [5 + 501ogA/'oo(7^,7)) + 120BVt / y^log A/'oo(7^, e)de . 

Jo 

As a corollary (cf. (2) in the introduction), when log A/’oo(7^, e) ^ Cpe~P with p G (0, 2), the 
Chaining Exponentially Weighted Average forecaster tuned as above and with 7 = 
has a regret of This in turn yields a regret of Reg' 7 i(A') = when F 

is the Holder class with regularity /3 > 1/2, which corresponds to the optimal (minimax) quadratic 
risk T“^4/(2/3 -i-i) statistical estimation with i.i.d. data. We address the particular case of Holder 
functions and the associated computational issues in Section 3 and Appendix C below. 
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Algorithm 2: Chaining Exponentially Weighted Average forecaster 
input : maximal range B > 0, tuning parameters .. ., > 0, 

high-scale functions : A" —>• M for 1 ^ y ^ Nq, 
low-scale functions : Af —)• M for A; G {1,..., K} and z G {1,..., A^a:}- 
initialization; set ■ioi = (]^; • • • j 7 ^) ^ ^Nq and 

for ally G {1,..., iVo} and A: G {1,..., AT}. 


for each round A = 1, 2 ,... do 

• Define the aggregated functions ftj : A^ —)■ M for all j G A^o} by 

K Nk 

/A = /r + EEslf#’- 


No 


k=li=l 


Observe xt G X, predict yt = E wt,jft,j{xt), and observe yt G [-B, B]. 

i=i 

Low-scale update: compute the new weight vectors = {u^t+ii 
j G {1, ..., Nq} and k G {1, ..., K} as follows: 


l^i^Nk 


^ ^Nk for all 


expj^- 2 (^ 2 /^ - fsj{xs)'^gl’'\xs) 


u 


s=l 


(iA) A_ 

i+l.* Nk / t N 

^expl -g^’^^'^-2(ys - fs,jixs)^gl!'\xs) 

i' = l \ S =1 / 


, z G {1,..., A^fc} . (6) 


High-scale update: compute the new weight vector Wt+i = {^t+i,j) G Aat^ as 


follows: 


wt+i,j = 


expf -r/W '^{vs - fs,ji 

\ S =1 ^ 




end 


No j f ^ 

^ exp ^ (y, - fsj> (x< 

j'=i 


) j £ {!)•■•) -^ 0 } ■ 


(V) 


s=i 


Another corollary of Theorem 2 can be drawn in the setting of sparse high-dimensional online 
linear regression, which is a particular case of a parametric class with p ss 0. In the same spirit 
as in Gerchinovitz (2013) and in Rakhlin and Sridharan (2014, Example 1), we consider d features 
(pi,..., (prf : Af —)• [—B, B] and we define F = • rt G A^, ||w||q = s} fo be the set of 

all s-sparse convex combinations of the features (||rt||o denotes the number of non-zero coefficients 
of u). Then, using Theorem 1 of Gao et al. (2013) with (M, q, p, r) = (s, 1, -|-oo, 2) we can see that 
logA/’oo(Ar, e) < log (j) -h s log(l + \/{£y/s)). Plugging this bound in Theorem 2 with 7 = Xj^fT 
yields a regret bound of order s log(l -|- dT/s). Thus, Theorem 2 also yields (quasi) optimal rates 
for sparse high-dimensional online linear regression. 
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Finally, for much larger function classes withp > 2, we could derive a slightly modified Dudley- 
type regret bound of O with a slightly modified algorifhm, in fhe same spirif as in Rakhlin 

and Sridharan (2014). We omif fhis case for fhe sake of conciseness. 

Remark 3 In Theorem 2 above, we assumed that the observations yt and the predictions f{xt) 
are all bounded by B, and that B is known in advance by the forecaster. We can actually remove 
this requirement by using adaptive techniques of Gerchinovitz and Yu (2014), namely, adaptive 
clipping of the intermediate predictions ftj{xt) and adaptive Lipschitzif cation of the square loss 
(i) 

functions i) . This modification enables us to derive the same regret bound (up to multiplicative 
constant factors) with B = maxt \ yt\, but without knowing B in advance, and without requiring 
that supjgj- ll/lloo cil^o upper bounded by B. Of course these adaptation techniques also make 
it possible to tune all parameters without knowing T in advance. 

Remark 4 Even in the case when B is known by the forecaster, the clipping and Lipschitzif cation 
techniques of Gerchinovitz and Yu (2014) can be useful to get smaller constants in the regret bound. 
We could indeed replace the constants 50 and 120 with 8 and 48 respectively. (Moreover, the regret 
bound would also hold true for y > B.) We chose however not to use these refinements in order to 
simplify the analysis. 

Remark 5 We assumed that the performance of a forecast yt at round t ^ 1 is measured through 
the square loss itivt) = {vt — which is I/(bOB^)-exp-concave on [—4B,AB]. The analysis 
can easily be extended to all rj-exp-concave (and thus convex) loss functions it on [—AB, AB] that 
also satisfy a self-bounding property of the form \ d£t/dyt\ ^ C£l (an example is given by it{yt) = 
\yt — yt\' with r ^ 2). The regret bound of Theorem 2 remains unchanged up to a multiplicative 
factor depending on B, C, and r. If the loss functions it are only convex (e.g., the absolute loss 
^t{yt) = \yt—yt\or the pinball loss to perform quantile regression), the high-scale aggregation step 
is more costly: the term of order log A/’oo(2^, 7 ) is replaced with a term of order y^T logA/’oo(2^, 7 )- 

Proof (of Theorem 2) We splif our proof info fwo parfs—one for each aggregafion level. 

Part 1: low-scale aggregation. 

In fhis parf, we fix y G {1,..., A^o}- As explained righf before (5), fhe fuple of weighf vectors 
e Atvi X ... compufed af all rounds corresponds exacfly fo fhe ouf- 

(j\ 

puf of fhe Mulfi-variable Exponenfiafed Gradienf forecasler when applied fo fhe loss functions if , 
defined in (5). We can fherefore apply Theorem 1 affer checking ifs assumptions: 

ii) 

• fhe loss funcfions if are indeed differentiable and joinfly convex; 

• fhenorms of fhe partial gradienfs are bounded by 30^7/2^ for all 1 ^ k ^ K. 

Ut 11 00 

Indeed, fhe z-fh coordinafe of V^(j,k)if' is equal fo 

= -‘^{yt - ft,j{xt)^9i'"\xt), ( 8 ) 

which can be upper bounded (in absolufe value) by 2 x 5B x 3yl2^. To see why fhis is frue, 
firsf note fhaf \gl^\xt)\ ^ = hkif) - vrfc_i(/)||^ for some / G (by definition 

ofg(*^)), so fhaf, by fhe friangle inequalify and by definition of iTk{f) and 

T 37 


(k) 
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Second, note that \yt - ?t,j{xt)\ ^ \yt\ + \ftj{xt)\ ^ 5B. Indeed, we have \yt\ ^ B by 
assumption and, by definition of ftj in (4), we also have 






■'i 


k=li=l 


\yi Kxt)\ 


k=\ 


where we used the inequalities ^ supjg^ ll/lloo ^ B (by assumption), and where 

we combined (9) with the fact that — 1- The last inequality above is obtained 

from the assumption ^ ^ B. Substituting the above various upper bounds in ( 8 ) entails that 
II Iloo ^ 30Bj/2^ for all 1 ^ A; ^ iC, as claimed earlier. 

We are now in a position to apply Theorem 1. It yields: 

T ^ 2 ^ 2 

'^{yt-ftjixt)] ^ inf '^(yt- + 91 +■ ■ ■ + gx) (xt)] 

t=l ^ 9 i,-,9k \ 

K 

+ 30^7/2 VlogiVfc , (11) 

fc=i 


where the infimum is over all functions gi G , gx G (we used the regret bound of 

Theorem 1 with Dirac weight vectors = 1,..., Nk). 

Now, using the fact that Nk ^ Afoo {N, 'y/2^)Moo {N, 7 / 2 ^“^) ^ [Moo [N, y/2^))^, we get 


E < 2 V 2 E [X _ y^logA/'oo(-T',7/2^) 

k=\ k=\ 

K p/2*‘ , - p/2 ! - 

^ 2\/2^ / JlogMoo{N,e)d£ ^ 2V2 JlogMoo{N,£)de , 
k=i -'7/2'“+! Jo 

where the inequality before last follows by monotonicity of e i-A Moo [N, e) on every interval 
[ 7 / 2 ^+^, 7 / 2 ^]. Finally, substituting the above integral in (11) yields 


^{yt-ftjixt)] ^ inf ^(yt- + 9i +■ ■ ■ + gx) [xt)] 

t=l ^ 9 i,-,9k \ 

+ 120 BVt j Y^logA/’oo(-T’, £)d£ . ( 12 ) 

Part 2: high-scale aggregation. 

The prediction yt = ft{xt) = ^f=i wtyft,j[xt) at time f is a convex combination of the inter¬ 
mediate predictions ftj{xt), where the weights wty correspond exactly to those of the standard 
Exponentially Weighted Average forecaster tuned with = 1/(505^) = l/(2(5i?)^). Since 
the intermediate predictions ft,j[xt) lie in [—45,4i?] (by ( 10 ) above), and since the square loss 
X ^ [yt — 2 )^ is -exp-concave on [—45,45] for any yt G [—5,5], we get from Proposi¬ 
tion 3.1 and Page 46 of Cesa-Bianchi and Lugosi (2006) that 
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min (yt - ftj (Xt)) + 


t=i 


isSjsSA^o ^ ^ - / rj 


log No 
( 0 ) 


'^{yt - ifo + 9 i + ■ ■ ■ + 9 k) ixt)y 

fo,9i,-,gK ^ 

r-7/2 


+ 120BVt j yj log Moo {N, e) de + 50i?^ log A^oo {N, 7 ) , (13) 


where the infimum is over all functions /o G 51 G ... ^qk G The last inequality 

above was a consequence of (12). Next we apply the chaining idea: by definition of the function 
sets J’io) D {7ro(/) : / G J"} and = {-KkU) - T^k-iif) ■ / G J"}, we have 
T 

inf '^{yt-{k +9i +■ ■ ■+9K){xt)f 
jo, 9 i,---,gK 

T 2 

^ (’^o(/)+ [7ri(/)-7ro(/)] + ...+ [-KKif) - TrK-i{f)]){xt)^ 

^ t=i 
T 

= ~ '^K{f)ixt)f 

t=i 


T 

^ [(y* “ /(®i))^ + 2-23 ||7rx(/) - /Iloo + hKif) - /IlL 

T „2 

i=l 


(14) 

(15) 


where (14) is obtained by expanding the square {yt — TT/f(/)(xt))^ = {yt — f{xt) + f{xt) — 
'^K{f)ixt))‘^, and where (15) follows from the fact that ||7ri^(/) — f\\^ ^ 7/2^ ^ B/T hy defi¬ 
nition of TTKif) ^ — [iog2(7^/^)l ■ Combining (13) and (15) concludes the proof. ■ 


3. An efficient chaining algorithm for Holder classes 

The Chaining Exponentially Weighted Average forecaster of the previous section is quite natural 
since it explicitly exploits the e-nets that appear in the Dudley-type regret bound (1). However 
its time and space computational complexities are prohibitively large (exponential in T) since it 
is necessary to update exponentially many weights at every round t. It actually turns out that, 
fortunately, most standard function classes have a sufficiently nice structure. This enables us to 
adapt the previous chaining technique on (quasi-optimal) e-nets that are much easier to exploit from 
an algorithmic viewpoint. We describe below the particular case of Lipschitz classes; the more 
general case of Holder classes is postponed to Appendix C. 

In all the sequel, F denotes the set of functions from [0,1] to [—B, B] that are 1-Lipschitz. 
Recall from the introduction that \ogNoo{F,£) = 0(e“^), so that, by Theorem 2 and (2), the 
Chaining Exponentially Weighted Average forecaster guarantees a regret of 0{T^/^y We explain 
below how to modify this algorithm with e-nets of (J^, || • ||oo) that are easier to manage from a 
computational viewpoint. This leads to a quasi-optimal regret of O log T); see Theorem 6. 
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3.1. Constructing computationally-manageable e-nets via a dyadic discretization 


Let 7 G (^, S) be a fixed real number that will play the same role as in Theorem 2. Using the fact 
that all functions in T are 1-Lipschitz, we can approximate T with piecewise-constant functions as 
follows. We partition the x-axis [0,1] into I /7 subintervals la = [(a — 1 ) 7 , 07 ), a = 1,..., I /7 
(the last interval is closed at x = 1). We also use a discretization of length 7 on the y-axis [—B, B], 
by considering values of the form = —B + jj, j = 0,, 2Bjj. (For the sake of simplicity, 
we assume that both I /7 and 2Bj^ are integers.) We then define fhe sef of piecewise-consfanf 
funcfions : [0,1] — [—B, B] of the form 
1/7 

/(0)(a;) = y"^C'a'ilxeI, 


a=l 


M 




2B 

= 0 ,...,— 

7 


(16) 


Using the fact that all functions in F are 1-Lipschitz, it is quite straightforward to see that F^^'> 
is a 7 -net^ of II • I loo)- (To see why this is true, we can choose G argmin^g(^{o) |/(xa) — c|, 
where Xa is the center of the subinterval F- See Lemma 13 in the appendix for further details.) 


Refinement via a dyadic discretization Next we construct 7 / 2 ”*-nets that are refinements of the 
7 -net F^^\ We need to define a dyadic discretization for each subinterval as follows: for any 
level m ^ 1, we partition F into 2 ™ subintervals n = 1 ,..., 2 ™, of equal size 7 / 2 "*. 

Note that the subintervals a = 1,..., I /7 and n = 1,..., 2™, form a partition of [0,1]. 

We call it the level-m partition. We enrich the set F^^'^ by looking at all the functions of the 
form -|- Ylm=i where G F^^^ and where every function is piecewise-constant 
on the level-m partition, with values G [— 7 / 2 "^“^, 7 / 2 ™“^] that are small when m is 

large. In other words, we define the level-M approximation set F^^'^ as the set of all functions 
fc ■ [ 0 , 1 ] —)• M of the form 

1/7 M 1/7 2 "* 

Ux) = + E E E 

a=l m=l a=l n=l 

-V-" ^-V-^ 

where G and G [— 7 / 2 ™“^, 7 / 2 ™“^]. An example of function fc = + 

X)m=i plotted on Figure 1 in the case when M = 2 (the plot is restricted to the interval F). 

Since all functions in F are 1-Lipschitz, the set of all functions fc is a 7 / 2 ^^+^-net of 
II • 11 00 ); see Lemma 13 in the appendix for a proof. Note that is infinite (the are 

continuously valued); fortunately this is not a problem since the can be rewritten as convex 

combinations 7 / 2 ™“^) -|- /2'^~^) of only two values; cf. (18) below. 


3.2. A chaining algorithm using this dyadic discretization 

Next we design an algorithm which, as in Section 2.2, is able to be competitive against any func¬ 
tion fc = f^^^ + X]m=i However, instead of maintaining exponentially many weights as in 
Algorithm 2, we use the dyadic discretization in a crucial way. More precisely: 

We run I /7 instances of the same algorithm A in parallel; the a-th instance Aa, a = 1,..., 1 / 7 , 
corresponds to the subinterval I a and it is updated only at rounds t such that xt G F- 

3. This 7 -net is not proper since f- F. 
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This function corresponds to the dotted line (level 2). 


Next we focus on subalgorithm Aa- As in Algorithm 2, we use a combination of the EWA and 
the Multi-variable EG forecasters to perform high-scale and low-scale aggregation simultaneously: 


Low-scale aggregation', we run ‘IB -|- 1 instances Saj, j = 0,..., 2B of the Adaptive Multi- 
variable Exponentiated Gradient algorithm (Algorithm 3 in the appendix) simultaneously. Each 
instance Baj corresponds to a particular constant = —B jj G and is run (similarly 
to (5)) with the loss function it defined for all weight vectors G A 2 by 


it m = 1,..., M, n = 1,..., 2™) 

/ M 2'^ . 

= {yt-i-B + n)-Y.Y.{^i ^-“2 


m=ln=l 


(m,n) 7 I (m,n) 7 i ^ 

+ Tr:—T 


(18) 


The above convex combinations (— 7 / 2 ™ ^( 7 / 2 ”^ ^) ensure that subalgorithm Baj 
is competitive against the best constants G [— 7 / 2 ™“^, 7 / 2 ™“^] for all m and n. 


The weight vectors output by subalgorithm Baj (when xt G la) are denoted by and we set 

ft,ajix) = -5+i7+Em=l En=l 2^ + 2^) all j = 0, . . . , 2B/-f. 


High-scale aggregation', we aggregate the 2B/^ -\- 1 forecasters above with a standard Exponen¬ 
tially Weighted Average forecaster (tuned, e.g., with the parameter rj = Ij (2(4i?)^) = 1/(32i?^)): 


^ _ ^-^23/'y ^ ^ 

It,a — 2-^j=0 Jt^aJ 


(19) 


Putting all things together', at every time f ^ 1, we make the prediction/t(xt) = ft,a{xt)^xt&ia ■ 

We call this algorithm the Dyadic Chaining Algorithm. 


Theorem 6 Let B > D,T ^ 2, and J- be the set of all 1-Lipschitz functions from [0,1] to [—B, B]. 
Assume that \yt\ ^ B. Then, the Dyadic Chaining Algorithm defined above and tuned 
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with the parameters 7 = BT and M = [log 2 ( 7 T/i?)] satisfies, for some absolute constant 
^ ^ Regy (-F) ^ c max{5, ^ _ 

The proof is postponed to the appendix. Note that the Dyadie Chaining Algorithm is eomputation- 
ally traetable: at every round t, the point xt only falls into one subinterval la ’ for eaeh level 
m = 1,..., M, so that we only need to update x M) = logT) weights at every 

round. For the same reason, the overall spaee eomplexity is 0(T x ‘IB x M) = O log T) . 

Remark 7 The algorithm can be extended to the case of Lipschitz functions on [0,1]“^. It leads to 
an optimal regret of order up to a log factor. Besides, the computational complexity is 

still tractable. Indeed, at each round t, the point xt only falls into one cell of the partition. Hence, 
the time complexity is polynomial in T with an exponent independent of d. This also applies to the 
space complexity if we use sparsity-tailored data types. The extension to Holder function classes on 
[0,1]*^ is however more difficult and we leave it for future work. 
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Appendix A. The chaining technique: a brief reminder 

The idea of chaining was introduced by Dudley (1967). It provides a general method to bound the 
supremum of stochastic processes. For the convenience of the reader, we recall the main ideas un¬ 
derlying this technique; see, e.g., Boucheron et al. (2013) for further details. We consider a centered 
stochastic process (Xy)ygjr indexed by some finite metric space, say, {F, IHloo), with subgaussian 
increments, which means that ^ ^v\'^\\f — gW^ for all A > 0 and all f,g € F. 

The goal is to bound the quantity E [ supjg Xj] = E [ sup f^r{Xf — Xf^)] for any /o G F. 

Lemma 8 (Boucheron et al. 2013) Let Zi,, Zk be subgaussian random variables with param¬ 
eter v > 0 (i.e., logEexp(AZj) ^ X‘^v/2 for all A G M), then Emaxj=i^...^x Zj ^ ^/2v\ogfK. 


Lemma 8 entails E[supjgj-(Ay — Ay^)] < By/2v log (card F ), where B = supyg_;r ||/ — /o||oo- 
However, this bound is too crude since X y and Xg are very correlated when / and g are very close. 
The chaining technique takes this remark into account by approximating the maximal value supy Ay 
by maxima over successive refining discretizations F^^'>,..., F^^'i of F. More formally, for any 
f ^ F, we consider a sequence of approximations vro(/) = /o G F^^\ ..., ttKif) = 

f G F^^\ where ||/ — TTk{f)\\oo ^ B/2^ and card = Afoo{F, Bl2^), so that: 


SUp(Ay-AyJ 

= E 

K-l 

sup ^ (^^T7k + l(f)~^TTkif)) 

K-l 

sup (y) A^^(y)j 




k=0 



We apply Lemma 8 for each k G {0,..., A — 1}: since ||7rfc+i(/) — TTk{f)\\oo ^ (by 

the triangle inequality) and card{7rfc+i(/) — Ttk{f)if G A} ^ A/’oo(A, 77/2^+^)^, we get the 
well-known Dudley entropy bound (note that e i—>• Woo (A, e) is nonincreasing): 


E 


SUp(Ay-AyJ 


K-l - . 5/2 

^ 6 V A2-^-yt;logWoo(A, A/2^+1) ^ 120) / 0ogWoo(A,e)W 
k=o 
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Appendix B. Adaptive Multi-variable Exponentiated Gradient 

In this subsection, we provide an adaptive version of Algorithm 1 when the time horizon T is not 
known in advance. We adopt the notations of Section 2.1. Basically, the fixed tuning parameters 
r]^^\ ..., are replaced with time-varying learning rates ..., . 


Algorithm 3: Adaptive Multi-variable Exponentiated Gradient 
input : optimization domain A^Vi x ... x (where Ai,..., Nk are positive integers). 

initialization; set = (^,... , G Aat^, for all A: = 1,..., AT. 


for each round f = 1, 2 , ... do 

• Output {u[^\ ..., G Atvi X ... X Aatj^ and observe the differentiable and jointly 

convex loss function it ■ ^Ni x ... x A^r^ —)■ M. 

(k) 

• Update the tuning parameters, ryj Mor all /c = 1, ..., A as follows: 


(k) 

hWi 


1 


1 


log 


i + EUI|| 




>0 


Compute the new weight vectors ..., wj+i) G A^Vi x ... x A^r^^ as follows: 


(K)) 


exp 




u 


(k) ^ 

t+l,i — 


s=l 




(k) 

t+1 


z G {1,..., Afc}, 


where d^(k)is denotes the partial derivative of ig with respect to z-th component of the vector 

^s,i 

variable u[^\ and where the normalization factor is defined by 


2S = E«p(A)\Ea 


Nk 


(k) 


2 = 1 


S = 1 


'^s,i 


end 


The Adapfive Mulfi-variable Exponenfiafed Gradienf algorifhm safisfies fhe regref bound of 
Theorem 9 below. 

Theorem 9 Assume that the loss functions : A^Vi x ... x A^r^ —)> M, f ^ 1, are differentiable 
and jointly convex. Assume also the following upper bound on their partial gradients: for all 

max^||V„,„<,|L<OW. (20) 
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Then, the Multi-variable Exponentiated Gradient algorithm (Algorithm 3 ) has a regret bounded as 
follows: 


T 

E' 

t=i 


u 


( 1 ) 




(K) 


T 

min } 


it(u 


,.( 1 ) 



logNk, 

k=l 


where = ELi I|| 
AtVi X ... X Aat^. 


and where the minimum is taken over all .. 


txW) G 


Proof (of Theorem 9) The proof starts as the one of Theorem 1 . From (31), we can see that 


E- 

t=i 




( 1 ) 


AK) 


min L ( 

.V 


,U 


(K) 


K / T 

E E«1 

k=i \t=i 
K 


(fc) C.{k) 


T 


■ u: — mm 




(k) 

i 


t=l 


E E 


(k) 


■ u: — mm 




9t, 


(k) 
i I ’ 


( 21 ) 




where = V and where = {f = 1,... ,r, \\Vu(k)£t\\^ > O}. 


Note that the right-hand side of (31) is the sum of K regrets. Let k G {1,..., K}. By definition 
of the Adaptive Multi-variable Exponentiated Gradient algorithm, the sequence of weight vectors 
corresponds exactly to the weight vectors output by the Exponentially Weighted Average 
forecaster with time-varying parameter (see Page 50 of Gerchinovitz 2011) applied to Nk experts 
associated with the loss vectors G t G We can therefore use the well-known 

corresponding regret bound available, e.g., in Proposition 2.1 of Gerchinovitz (2011). Noting that 
the loss vectors g[^'^ lie in [—by Assumption (20), and setting = cardT^^\ we 
thus get that 

^ logAfc. 


Note that the additional term \/log Nk in the upper-bound of Gerchinovitz (2011) is actually 
not needed, since we can assume that because is not used by the algorithm at 

rounds t A, T. Substituting the last upper bound in the right-hand side of (21) concludes the proof. ■ 
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Appendix C. An efficient chaining algorithm for Holder classes 

In this appendix, we extend the analysis of Section 3 to Holder function classes. In the sequel 
T denotes the set of functions on [0,1] whose q first derivatives (q G N) exist and are all bounded in 
supremum norm by a constant B, and whose qth derivative is Holder continuous of order a G (0,1] 
with coefficient A > 0. In other words, any function f £ B satisfies 

Vx,?/G [0,1], |/(‘')(x)-/('^)(7/)| ^ A|x- 2 /|", (22) 

and ||/(^)||oo ^ B for all k G {0,..., q}. We denote by = q + a the coefficient of regular¬ 
ity of B. Recall from the introduction that logA/’oo(2^, £) = so that, by Theorem 2 

and (2), if /3 > 1/2, the Chaining Exponentially Weighted Average forecaster guarantees a regret of 
which is optimal. We explain below how to modify this algorithm with non-proper 
e-nets of {B, IHloo) that are easier to manage from a computational viewpoint. This leads to a 
quasi-optimal regret of O (T^/(2/3-I-1) (log T)^/^). 

The analysis follows the one of Section 3 which dealt with the special case of 1-Lipschitz func¬ 
tions. The main difference consists in replacing piecewise-constant approximations with piecewise- 
polynomial approximations. 


C.l. Constructing computationally-manageable e-nets via exponentially nested discretization 

Let 7 G (y, be a fixed real number that will play the same role as in Theorem 2. Using the fact 
that all functions in B are Holder, we can approximate B with piecewise-polynomial functions as 
follows. 

Let 5a; > 0 and > 0 be two discretization widths that will be fixed later by the analysis. We 
partition the x-axis [0, 1] into Xjbx subintervals la = [(a — l)5a;, a6x), a = 1 , . . ., l/5a; (the last 
interval is closed at x = 1). We also use a discretization of length 6y on the y-axis [—B, B], by 
considering the set 

yo) + . j = o,..., 2 H/ 5 ,}. 

Lor the sake of simplicity, we assume that both Xjbx and 2B jby are integers. Otherwise, it suffices 
to consider [l/5a;l [ 2B / by'\ , which only impacts the constants of the final Theorem 11 . We 

then define the sets of clipped polynomial functions for every a G {l,...,l/5a;} 



I ^ ( 

ao + 



oo, 


G 


Here, [y is the clipping operator defined by [x]b — min max{—H, x}} and Xa is the center of 

la- Now, we define the set of piecewise-clipped polynomial functions : [0,1] —)■ [—B, B] 
of the form 


l/<5a: 


/(0)(x) = V P^^Hx)Ixei. , Va G {1,..., 1/5.} P}") G 


a=l 


(23) 


Remark that the above definition is similar to (16), where the constants have been substi¬ 
tuted with clipped polynomials. Using the fact that all functions in P are Holder, we can see (cf. 
Lemma 10) that for 5. = 2 (q! 7 /( 2 A))^/^ and by = 7 /e, the set is a 7 -net^ of (p, || • jjoo)- 

4. This 7 -net is not proper since 2 P. 
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Refinement via an exponentially nested discretization Next we construct 7 / 2 ”*-nets that are 
refinements of the 7 -net We need to define an exponenfially nesfed discrefizafion for each 
subinferval as follows: for any level m ^ 1 , we parfifion info 4™ subinfervals n = 

1.. .. ,4”^, of equal size Nofe fhaf fhe subinfervals a = 1,..., l/Ja; and n = 

1.. .., 4”*, form a parfifion of [0,1]. We call if the level-m partition. 

Now, we design fhe sefs of clipped polynomial functions fhaf will refine fhe approxima¬ 
tion of iF on each inferval To do so, for every m ^ 1 we sef successive dyadic refining 

discretizations of fhe coefficienfs space [—B, B]: 

yi^) 4 I _ S + j5y/2^ : i = 0,..., 2^+^B/6y} , (24) 

and we define fhe corresponding sefs of clipped polynomial functions for all o G { 1 ,..., all 

m G {1,..., M}, and n G {1,..., 4F} 


■p(m,n) A 
' a 


|x I-)- 


do + 


1 ! 



(m,n) 

a 





{m,n) 

a 






(25) 

where is fhe cenfer of fhe inferval Then, we define fhe sefs of differences befween 

clipped polynomial funcfions of fwo consecufive levels 


Q(m,n) 


p{m) _ p(m-l) 


J 37/2” 


. p(m) ^ p{m,n) p{m-l) ^ 


where rim-i denotes fhe unique integer n' such fhaf c \ (For m = 1, 

is replaced wifh in fhe definilion of Qi™’"^). The funcfions in will play fhe same role as 

fhe consfanfs for fhe Lipschifz case fo refine fhe approximafion from fhe level-(m — 1) parfi- 

fion fo fhe level-m partition. Note fhaf each G lakes values in [— 87 / 2 "^, 87 / 2 "*]. 


Then, we enrich fhe sef by looking af all fhe functions of fhe form +Ylm=i where 
G F^^^ and where every funclion is fhe difference of a piecewise-clipped polynomial on 
fhe level-m parfifion and a piecewise-clipped polynomial on fhe previous level m — 1 , wifh values 


Qijn,n) 


G 


In ofher words, we define the level-M approximation set F^^'> as fhe sef of all funcfions /c : 
[ 0 , 1 ] —)■ M of fhe form 

l/<5^ M 1/5:^ 4’" 

Mx) = Y, + E E E > (26) 

a=l m=la=l n=l 

-s/-^ -s/-^ 

where G and G ■ Once again, see (26) as an extension of (17), where fhe 

consfanfs have been replaced wifh 

Using again fhe facf fhaf all funcfions in F are Holder, we can show fhaf fhe sef p(^) of all 
funcfions fc is a 7 / 2 ^-nel of {F, || • jjoo) ; see Lemma 10 below (whose proof is posfponed fo 
Appendix D.3) for furfher defails. 
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Lemma 10 Let T be the set of Holder functions defined in (22). Assume that j5 = q + a ^ 1/2. 
Let 6x = 2 (g! 7 /( 2 A))^/^ and 6y = 7 /e. Then: 

• the set defined in (23) is a ^-net of (J^, || • ||oo).' 

• for all M ^ 1, the set defined in (26) is a ^/2^-net of (J^, || • ||oo)- 

C.2. A chaining algorithm using this exponentially nested refining discretization 

Next we design an algorithm which, as in Section 3, is able to be competitive against any function 
fc = computationally tractable. More precisely: 

We run l/8x instances of the same algorithm A in parallel; the a-th instance corresponds to the 
subinterval and it is updated only at rounds t such that xt ^ la- 

Next we focus on the o-th instance of the algorithm A, whose local time is only incremented 
when a new xt falls into /„. As in Algorithm 2, we use a combination of the EWA and the Multi- 
variable EG forecasters to perform high-scale and low-scale aggregation simultaneously: 


Low-scale aggregation: we run cardPa°^ ^ {‘^B/6y -|- 1)('J+^) instances Baj, j = 1,..., cavdvi^^ 
of the Adaptive Multi-variable Exponentiated Gradient algorithm (Algorithm 3 in the appendix) 
simultaneously. Each instance Baj corresponds to a particular polynomial G and is run 
(similarly to (5)) with the loss function it defined for all weight vectors g A ^(m,n) by 

card, 




M 4™ card si”*’"' 
m=l n=l k=l 


xt&L 


(m,n) 


(27) 


Here, ■ ■ ■ denote the elements of that have been ordered. The above con¬ 
vex combinations ensure that subalgorithm Baj is competitive against the best 

elements in on subintervals for all m and n. The weight vectors formed by this 

subalgorithm Baj (when xt G Iq) are denoted by and we set for all j = 1,..., card 


M 4’" card si™’"' 

= -p£’(*) + E E E slIJl . 


m=l n=l k=l 


where is the jth element of Pa^\ 


High-scale aggregation: we aggregate the forecasters above ft^a,j for j G {l,..., cardPa^^} 
with a standard Exponentially Weighted Average forecaster (tuned, e.g., with the parameter rj = 
1/(2(5H)2) = 1/(50H2)): 

card-pi°' 

ft,a — ^ ^ Wt,a,j ft,a,j ■ (28) 

j=l 
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Putting all things together: at every time t ^ 1, we make the prediction/t(xt) = J2a=i ft,a{xt)ll-xt&ia 
We call this algorithm the Nested Chaining Algorithm for Holder functions. 

Theorem 11 Let B > 0, T ^ 2, and J- be the set of Holder functions defined in (22). Assume 
that (3 = q + a ^ 1/2 and that maxi^t<gr \yt\ ^ B. Then, the Nested Chaining Algorithm for 
Holder functions defined above and tuned with the parameters 6x = 2 (g'! 7 /( 2 A))^/^, 6y = 7 /e, 

7 = and M = [log2(7r/il)] satisfies, for some constant c > {) depending only on q 

and A, 

The proof is postponed to Appendix D.5. The logarithmic factor (logT)^/^ can be reduced to 
log T, by partitioning la into 2"*/^ subintervals instead of 4™ subintervals. However, the 

partition at level m ^ 2 is then not necessarily nested in the partitions of lower levels, which makes 
the proof slightly more difficult. 


Note that the Nested Chaining Algorithm for Holder functions is computationally tractable as shown 
by the following lemma, whose proof is deferred to Appendix D. 6 . 


Lemma 12 Under the assumptions of Theorem 11, the complexity of the Nested Chaining Algo¬ 
rithm for Holder functions defined above satisfies: 


• Storage complexity: 

• Time complexity: O 


logr) ; 

(rp{q+U{^+^^) logT^ . 
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Appendix D. Omitted proofs 

In this appendix, we provide the proofs which were omitted in the main body of the paper. 


D.l. Proof of Theorem 1 


As is the case for the classical Exponentiated Gradient algorithm, the proof relies on a linearization 
argument. Let ..., G x ... x A^v^. By differentiability and joint convexity of it 

for all f = 1,..., T, we have that 


t=i 


v(i) 


t[Ul Ul 


(K) 


t=l 


-'^it\yS^\ ... 




t=i 
T K 










t=l k=l 


(29) 

(30) 


where Vf* in (29) denotes the usual (joint) gradient of it (with Ylk=i components), and where (30) 
follows from splitting the gradient into K partial gradients: Vit = {V^(i)it, ■ ■ ■, V^(K)it) ■ 

As a consequence, setting = V^(k)it ..., G and taking the maximum of 

the last inequality over all ..., G A^Vi x ... x A^r^^, we can see that 


E' 

t=i 


t(i) 




t[ul ul 'J 

K 

E 


mm 'y^^it[u^^\ ... ,u 




(K) 


t=l 


^ > max 

ii('=)eAjv ^ 


k=l 

K 


''fe t=l 


E E«1 

k=i \t=i 


(k) -(fc) 


■ u; — mm 


E^l 

t=i 


. 


(31) 


where the last inequality follows from the fact that the function u^^'> I—)• Y^=i linear 

over the polytope A^r^,, so that its minimum is achieved on at least one of the vertices of AjVj, ■ 


Note that the right-hand side of (31) is the sum of K regrets. Let k G {1,..., K}. By definition 
of the Multi-variable Exponentiated Gradient algorithm, the sequence of weight vectors 
corresponds exactly to the weight vectors output by the Exponentially Weighted Average forecaster 
(see Page 14 of Cesa-Bianchi and Lugosi 2006) applied to experts associated with the loss vec¬ 
tors G M^'', t ^ 1. We can therefore use the well-known corresponding regret bound available, 
e.g., in Theorem 2.2 of Cesa-Bianchi and Lugosi (2006) or in Theorem 2.1 of Gerchinovitz (2011). 
Noting that the loss vectors lie in by Assumption (3), we thus get that 

^ G(^V22"logAfc . 

t=i t=i 
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substituting the last upper bound in the right-hand side of (31) concludes the proof. 

D.2. An efficient 7 -net for Lipschitz classes 

Lemma 13 Let T be the set of functions from [0,1] to [—B, B] that are 1-Lipschitz- Then: 

• the set defined in (16) is a 'y-net of (J^, || • ||oo).' 

• for all M ^ 1, the set defined in (17) is a 'y jl^^^-net of (J^, || • ||oo)- 

Proof (of Lemma 13) 

First claim: is a 'y-net of || • ||oo)- 

Let f ^ F. We explain why there exist ..., G such that 

1/7 

/(0)(a;) = 

a=l 

satisfies |/(x) — f^^\x)\ ^ 7 for all x G [ 0 , 1]. We can choose G argmin^g(^(o) |/(xa) — c|, 
where Xa is the center of the subinterval la- Indeed, since we can approximate f(xa) with precision 
7/2 (the y-axis discretization is of width 7 ), and since / is 1-Lipschitz on we have that, for all 
aG{l,...,l/7} and all x G La, 

\f{x) - 4°^! ^ \ f{x) - f{Xa)\ + \f{Xa) -4°^|^| + |=7- 
Since the subintervals /a, a ^ I/ 7 , form apartition of [0,1], we just showed that \\f — f^^'^Woo ^ 7 - 
Second claim: is a ^/2^-net of (J^, || • ||oo)- 

Let f £ F. We explain why there exist constants G and G [— 7 / 2 "*“^, 7 / 2 "*“^] 

such that 

1/7 M 1/7 2 ’" 

ux) = y; + E E E 

a=l m=l a=l n=l 

satisfies |/(x) — fc{x)\ ^ 7 / 2 ^+^ for all x G [0,1]. We argue below that it suffices to: 

• choose the constants G argmin^g(^{o) \f{xa) — c| exactly as for above; 

• choose the constants in such a way that, for all levels m G {1,..., M}, and for all 

positions a G {1,..., I/ 7 } and n G {1,..., 2™}, 

m 

/{rti”’"’) = 4”> + E ’ <32) 

m'=l 

where x^’"''^ denotes the center of the subinterval and where n^,' is the unique integer 

n' such that ^{rn^n) ^ j{m',n') 

. Such a choice can be done in a recursive way (induction on m). 
It is feasible since the functions in F are 1-Lipschitz (see Figure 1 for an illustration). 
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To conclude, it is now sufficient to use (32) with m = M. Note indeed from (17) that, on each 
level-M subinterval the function /c is equal to 

M 

/c(i') = c™ + X) d”'"'”’. 
m=l 


where rim is the unique integer n' such that la 


{M,n) ^ jirn^n') 
^ In. 


Thus, by (32), we can see that 


fc{xi ’”^)=/(xi for all points xi ’"■% a £ {1,..., I/ 7 } and n E {1,..., 2^}. 

Now, if X E is any point in then it is at most at a distance of 7 / 2 ^+^ of the 

middle point Therefore, by 1-Lipschitzity of /, we have |/(x) — | ^ 

Using the equality /c(xi^’’^^) = /(xi^’"^^) proved above and the fact that /c is constant on 
we get that 

Va E {1,..., 1 / 7 }, Vn E {1, .. ., 2^}, Vx E , |/(x) - /e(x)| ^ . 


Since the level-M subintervals a E {l,...,l/ 7 } and n E {1,..., 2^}, form a partition of 

[0,1], we just showed that ||/ — /c||oo ^ 7/2^“'“^, which concludes the proof. ■ 


D.3. An efficient 7 -net for Holder classes (proof of Lemma 10) 

First claim: is a j-net of || • ||cxd)- 

Let f € T. We explain why there exist Pa^'* E for all o E {1, ..., such that 

l/Sx 

/(0)(x) = ^Pi0)(x)W, 

a=l 

satisfies |/(x) — /^°^(x)| ^ 7 for all x E [0,1]. Fix o E {1,..., l/Sx} and lef Xa be fhe center of 
fhe subinferval By Taylor’s formula for all x £ la fhere exisf ^ £ la such fhaf 

fix) = fiXa) + f'iXa)ix - Xa) + ^ - Xa)^ H-h ^ ~ 

+ ^,^\0-f^‘^iXa))ix-Xar. ( 33 ) 

Thus, fhe function / can be wriffen as fhe sum of a polynomial and a term (fhe lasf one) fhaf will 
be proven fo be small by fhe Holder properly (22). Now, for every derivative f E {0,..., q} we can 
choose hi £ such fhaf 

\f^^ixa)-bi\^5y/2. ( 34 ) 

Indeed, fhe y-axis discretization of [—B, B] is of widlh 5y and |/^*^(xa)| ^ H by definition of 
F. Thus, selling 

Pa\x) = ho + y(x - Xa) + ^{x - Xa)^ H-h ^(x - X^)'' , 
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the polynomial satisfies by (33) for all x € la 


i=0 
<? 




Z! 


\x - XaT + -r 




X — Xn 


^ ^ ^ k - Xa\" I? - Xal" \x - Xal^ , 
^ 2z! q\ 


s;i 


where the second inequality is by (34) and because is a-Hdlder with coefficient A. Now, since 
1 ^ — Xa\ and \x — Xa\ are bounded by 5x/‘2^, it yields 



The choices 5x = 2 (g'! 7 /( 2 A))^/^ and 5y = 7 /e finally entail 


f{x) - 


^|/(x)-Pio)(x)|^^ + |=7. 


This concludes the first part of the proof. 


Second claim: is a -net of (J^, || • ||oo)- 

Let f € P. We explain why there exist clipped-polynomials G and G 

such that 

l/<5a: M 

Ux) = Y, pi^\x)ix^i^ + E E E 

a=l m=la=l n=l 

satisfies \ f{x) — /c(x)| ^ 7 / 2 ^ for all x G [0,1]. To do so, we show first that there exist clipped 
polynomials G and pjf^’"''l £ 

4 1/^X 

Ux) = Y + E E 

a=l n=l a=l 

M 1/<5j: 4"* 

+ E E E fP'" - 


satisfies \ f{x) — /c(x)| ^ 7 / 2 ^ for all x G [0,1]. We recall that rim-i denotes the unique integer 
n' such that c \ First we remark that the function /c defined above equals pU’'^^ 


on each level-M subinterval I, 


{M,n) 


Thus, it suffices to design clipped polynomials pU'"'^ G such that |/(x) — pU''^\x)\ ^ 

7/2"^ for all X G lU’^\ To do so, we reproduce the same proof as for above. Because 
diamlU’^^ = 6x14^ ^ bxjP^^^ (recall that (3 ^ 1/2), for every position a G {1,..., l/5x}. 
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every level m G {1,..., M}, and every n G {1,..., 4"^}, we can define as for above a 
polynomial 



(recall that is the center of such that all coefficients bj have the form —B + Zj6y2 ^ 

for some zj G {0,..., B / 5y} and 


/(x)-[Pi™’")(x)]^ ^7/2^ 


(35) 


for all X G To conclude, we choose the clipped polynomials 


To conclude the proof, we see that for all X G by the triangle inequality 



so that /c = Jc for the choices = p!^'"'\x) — Pi™ ^^(x) 


37/2^ 


D.4. Proof of Theorem 6 

We split our proof into two main parts. First, we explain why each functions ft^a incurs small 
cumulative regret inside each subinterval la- Second, we sum the previous regret bounds over all 
positions a 

Part 1: focus on a subinterval P In this part, we fix some a G ,1/7} and we consider 

the a-th instance of the algorithm A, whose local time is only incremented when a new xt falls into 
la- As in Algorithm 2, our instance of algorithm A uses a combination of the EWA and the Multi- 
variable EG forecasters to perform high-scale and low-scale aggregation simultaneously. Thus, the 
proof closely follows the path of the one of Theorem 2. We split again the proof into two subparts: 
one for each level of aggregation. 

Subpart 1: low-scale aggregation. 

In this subpart, we fix j G {0,..., 2 P/ 7}. The proof starts as the one of Theorem 2 except that A 
applies the adaptive version of the Multi-variable Exponentiated Gradient forecaster (Algorithm 3, 
Appendix B) with the loss function It defined in (18). We will thus apply Theorem 9 (available 
in Appendix B) instead of Theorem 1. After checking its assumptions exactly as in the proof of 
Theorem 2, we can apply Theorem 9. The norms of the loss gradients || are bounded 

by I6P7/2™ if Xt falls in and by 0 otherwise. Setting = Y^=i ^ Theorem 9 

Xf Gia 

yields as in (11): 



(36) 


t=i 
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^ inf 

V(m,n) 



M 2" 




m=l n=l 


M 2^ ! - 

+ 2 EE 16 ^ 7 / 2 ”^ log 2 , 

m=l n=l 


2 




( 37 ) 


where the infimum is over all constants G [— 7 / 2 '"“^, 7 / 2 "^“^] for every m = 1,..., M and 

n = 1,..., 2”*. But, for each level m = 1,..., M, the point xt only falls into one interval 

Thus, Yln=i = Ta, where Ta = Ylt=i ^xt&ia the final local time of the a-th instance of 

A. Therefore, using the concavity of the square root and applying Jensen’s inequality, (37) entails 


{vt IxtSLa 



M 

+ 32^7 ^ 2-"*v'^a2”^ log 2 

m=l 



+ 32^7(1+ \/2)VTa log 2 . (38) 

The second inequality is because X]m=i 2“™/^ = 1 + 'J2. 


Subpart 2: high-scale aggregation. 

Following the proof of Theorem 2, we apply EWA to the experts ft,a,j for j G {0,..., 227 / 7 } 
with tuning parameter rj = 1/(2(427)^) because ft^aj C [—B — 27,27 + 27 ] C [—327,327] and 
yt G [—27,27]. We get from Proposition 3.1 and Page 46 of Cesa-Bianchi and Lugosi (2006) that 


i=l 


[yt ~ fbaixt)) Ixteia 
T 

[yt ~ ft,a,j 


^ min 


^ min 


^Xt&Ia + 


log ( 25/7 + 1 ) 


t=l 


inf 


T 

^2 1 yt 


x&r, 


(m,n) 


-B+n + y cy^h 

+ 325(1 + \/2)77ralog2 + 3252 log ( 25/7 + 1) , 


^XtGla 

(39) 


where the infima are over all j G {0,..., 2Bj^} and all constants G [- 7 / 2 "^ i]^ 

and where the second inequality follows from (38) and from r/ = 1/(325^). 


27 







Gaillard Gerchinovitz 


Part 2: we sum the regrets over all subintervals la By definition of ft, we have 


T T X 1/7 

{vt - ft{xt)f = Y{y^ - 

t=l t=l ^ a=l 

T , ^ V 2 

~ 'Yj {yt~ ft,a{xt) j IxtS/a 

a=l t=l ^ ' 

Now, by definition of summing (39) over all a = 1,..., I /7 leads to 

^ ^ ^ 09 d 2 

X] (y* - ^ {yt - log (2^/7 + 1 ) 


t=i 




4=1 


7 

+ 325(1 + v/2)7V^ ( X] ) 


\a=l 




(40) 


Then, using that Y(!=i '^a = T, since at every round t, the point xt only falls into one subinterval 
la, and applying Jensen’s inequality to the square root, we can see that 

1/7 

YVTa^VYl- 

a=l 


Therefore, substituting in (40), we obtain 

J]; {yt - ftixtY ^ inf^ {yt - fixtY + log (25/7 + 1) 

4=1 4=1 ^ 

+ 325(l + \/ 2 ) 77 Tlog 2 . (41) 

But, is by Lemma 13 a 7 / 2 ^+^-net of 5. Using that M = [log 2 ( 7 r/ 5 )] and following the 
proof of (15), it entails 

^ ^ ^ E (y^ - + 2 ^' + ^ • 

Finally, from (41) we have 

^ ^ oo d 2 

E (2/* “ ^ inf E ^- log (25/7 + 1) 

4=1 •'^■^4=1 ^ 

52 

+ 325(1 + V 2 ) VTriog2 + 252 + — . (42) 

The above regret bound grows roughly as (we omit logarithmic factors and small additive terms): 

7 “^ + \/7’- 

Optimizing in 7 would yield 7 ss T“i/3 and a regret roughly of the order of More rigorously, 
taking 7 = and substituting it in (42) concludes the proof. 
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D.5. Proof of Theorem 11 

The proof closely follows the one of Theorem 6 . It is split into two main parts. First, we explain 
why each function ft^a incurs a small cumulative regret inside each subinterval la- Second, we sum 
the previous regret bounds over all positions a = 1,... ,l/6x- 


Part 1: focus on a subinterval la In this part, we fix some a € {1,... ,1/5x} and we consider 
the a-th instance of the algorithm A, denoted Aa, whose local time is only incremented when a 
new xt falls into /„. As in Algorithm 2, Aa uses a combination of the EWA and the Multi-variable 
EG forecasters to perform high-scale and low-scale aggregation simultaneously. We split again the 
proof into two subparts: one for each level of aggregation. 

Subpart 1: low-scale aggregation. 

In this subpart, we fix j G card'Pa*^^}. Similarly to the proof of Theorem 6, we start by 

applying Theorem 9. Since the elements in are bounded in supremum norm by 87 / 2 ™, and 

since the elements in are bounded by B, the norms of the gradients of the loss function (defined 

in (27)) are bounded by 0 if x* ^ follows otherwise: 


I II ^ 2 ( |y;:| -f lift, 


t,a,j 

Here, we used that 

,(0) 


j I 


Qlr> 


< 2(B + 4B)3yl2”‘ = 3 OB 7 / 2 ” . 


M 4™ card si™’"' 


U. 


(m,n) 




M 


|/w(^)| ^ ll^iy L + Z] Z 

m=ln=l k=l 

Thus, setting ^ ^ ’ Theorem 9 yields: 


m=l 


^7 

2*^ 


(43) 


{yt l-xt&Ia 


t=l 




inf 


Q (m,n) w/ \ ■ 


-h 


Vt 
M 4” 


Z ( + Z 


M 4"^ 


m=l n=l 


m=l n=l 


30H7/2"*WTi™’’"^ log (card Q, 


N(m,n) 


(44) 


where the infimum is over all polynomial functions C for every m = 1,..., M and 

n = 1,..., 4”*. But, for each level m = 1,..., M, the point xt only falls into one interval 
Thus, = Ta, where T, = is the final local time of the a-th instance of 

A. Therefore, using the concavity of the square root and applying Jensen’s inequality, (44) entails 


{yt "^Xt&la 

t=l ^ 


(45) 
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^ inf 


yt 


(m,n) 

M I - 

v(m,n) 
la 


m=l 


+ 60^7 X] 2 Wra4”^ log (card Q, 


( 46 ) 


Now, by the definitions of and (see Equations (24) and (25)), we can see that 

cardQi”^’’^) ^ card ^ ( card3^(™))^^''^^^ = (r^+^B/Sy + 

which yields 

y~] 2“”^Y'4™log (^card ^ ^ \J 2{q + 1) log {T^+^eB/^ + l) 

m=l m=l 

^ MY^2(q + 1) log {2^+^eBh + 1). 

Thus, using M = [log 2 ( 7 r/i?)], so that 2 ^ 7 “^ ^ 2T/B and combining the above inequality 
with (46), we have 


[vt ft,a,jixt)j IxtS/, 


t=l 


< ,„,w, EE-- E(-■) 

Qa ’ y \ (m,n) / / 

+ 6057[log2(7r/B)] v^2(q + l)ralog(4eT + 1). 


(47) 


(48) 


Subpart 2: high-scale aggregation. 

Following the proof of Theorem 6 , we apply EWA to the experts ft^a,j for j G {l,..., cardPa^^} 
with tuning parameter r/ = 1/(2(577)^) = 1/(5077^) becauseG [—477,477] (see (43)). From 
(48) and using cardP® ^ ilBjby + 1)'^+^ = {2eBl^ + l)'^'*'^, we have 


t=i 


'y ^ {yt ft,a{xt)j ^Xt&l, 


^ '^{yt- ^xtela + 


Isgjsgcard'Pi®' 


log (cardT^i'^^) 
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Qi-’"\v(m,n) \ (^) J ) 

+ 6057[log2(7r/S)] v^2(g + l)Ta log(4er + 1) 

+ 50^2(5 + 1) log (2eS/7 + 1) , (49) 

where the infimum is over all functions G and G and where the second 

inequality follows from rj = l/(50i?^). 

Part 2: we sum the regrets over all subintervals la By definition of ft, we have 

T ^ T . 1/5. ^ ^2 

E (y^ ~ Mxt)f = E “ E 

t=l t=l ^ a=l 

l/Sx T , ^ V 2 

~ El El iyt~ ft,a{xt) j ^xtGla 
a=i t=i ^ y 

Now, by definition of summing (49) over all a = 1,... ,l/5x leads to 

T T 

E(y *“ftixtif ^ inf y^iyt- 


t=l 


^ inf E' 

/6jr(M) ^ 

+ 50B'^(q + 1 ) log( 2 ei ?/7 + 1)5“^ 




+ 60S7riog2(7r/i?)l y/2(g + 1) log(4er + 1) | E | • 

(50) 


a=l 


Then, using that Yla=i '^a = T, since at every round t, the point xt only falls into one subinterval 
la, and applying Jensen’s inequality to the square root, we can see that 

l/l5a; 

E vE4- 

a=\ 

Therefore, substituting in (50) and because 5^ = 2 (q! 7 /( 2 A ))^/^, we have 


E ^, inf ^ {yt - f{xt)y 


t=i 


- inf E' 

+ 25B‘^{q + 1) log(2e5/7 + l)(g!7/(2A))"^/^ 

+ 60i?7rlog 2 ilT/B)] \l{q + l) \og{AeT + l)T{q\^/{2X))-^'^ . 

But, is by Lemma 10 a 7 / 2 ^-net of P. Using that M = [log 2 ( 7 T/i?)] and following the 
proof of (15), it entails 

E ^ ^ + 4^2 + E . 


t=l 
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Finally, we have 

{yt - ^ inf X] ~ 

i=i t=i 

+ 25B\q + 1) log( 2 eB /7 + 1)(g! 7 /( 2 A))"^/^ + ^ 

+ 60^7rlog 2 hT/B)] ^J{q + l) log(4er + l)r(g!7/(2A))“^/^ . 

(51) 

The above regret bound grows roughly as (we omit logarithmic factors and small additive terms): 

Optimizing in 7 would yield 7 and a regret roughly of order More 

rigorously, taking 7 = and substituting it in (51) concludes the proof. 


D. 6 . Proof of Lemma 12 

Storage complexity. Fix a position a G ,l/5a;}- At round f ^ 1, the Nested Chaining 

Algorithm for Holder functions needs to store: 

• the high-level weights wt,a,j for every j G {1,..., card |; 

• the low-level weights for every j G {l,... jCardT*®}, every m G ,M}, 

every n G {1, ..., 4™}, and every /c G {l,..., card 

The complexity of the oth instance of A is thus of order 

cardP^^) X M X 4^ X card . 

Now, we bound each of these terms separately. First for 7 = we have 


cardP^°) ^ {2B/5y + 1)''+^ = (2eH/7 + 1)''+^ = (2er^/(2/3+i) + iy+^ 


because 6y = e/ 7 . Second using M = [log 2 ( 7 T/H)], we can see that 

4 ^ = {2^')^ ^ ( 2 'yT/By = , 


and that 

card ^ (2“+^eH/7 -f ^ (4eT -f l)2(9+i) = . 

Putting all things together the space-complexity of the ath instance of A is of order 

Q y^2q+4+l3{q-l)/{2/3+1) _ 
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The whole storage eomplexity of the algorithm is thus of order 

^ ^ ji2(j+4+/3((j—l)/(2/3+l)!^ ^2^2(J+4+(/3((;—I)+l)/(2/3+I) jQg 

where we used that (5^ = 2 (g! 7 /( 2 A))^/^ = 

Time complexity. At round t 1, xt only falls into one subinterval la and one subinterval 
for eaeh level m = 1,..., M. It thus needs to update 

• the weights wt,a,j for a single position a and for every j G 11 ,..., card 

• for every level m = 1,..., M the weights ^ ’j ^ for a single position a and a single n, but 
for all j G {l, ... jCardPl^^} and all A: G {l, ... ,card 

The time-eomplexity is thus bounded by 

O^cardP® X M X cardSi"’”)) = 0(r(''+^)(2+/3/(2/3+i)) ^ 
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